Masked image modelling (e.g., Masked AutoEncoder) and contrastive learning (e.g., Momentum Contrast) have shown impressive performance on unsupervised visual representation learning. This work presents Masked Contrastive Representation Learning (MACRL) for self-supervised visual pre-training. In particular, MACRL leverages the effectiveness of both masked image modelling and contrastive learning. We adopt an asymmetric setting for the siamese network (i.e., encoder-decoder structure in both branches), where one branch with higher mask ratio and stronger data augmentation, while the other adopts weaker data corruptions. We optimize a contrastive learning objective based on the learned features from the encoder in both branches. Furthermore, we minimize the $L_1$ reconstruction loss according to the decoders' outputs. In our experiments, MACRL presents superior results on various vision benchmarks, including CIFAR-10, CIFAR-100, Tiny-ImageNet, and two other ImageNet subsets. Our framework provides unified insights on self-supervised visual pre-training and future research.
translated by 谷歌翻译
对人工智能(AI)的兴趣及其应用在过去几年中存在前所未有的增长。这种成功可以部分地归因于AI的子字段的进步,例如机器学习,计算机视觉和自然语言处理。深入学习,这些领域的大部分增长都是可能的,利用人工神经网络的机器学习子区域。这对视力和语言的整合产生了重大兴趣。在这项调查中,我们专注于通过讨论其问题制定,方法,现有数据集,评估措施,并比较用相应的最先进方法获得的结果来集成语言和愿景的十个突出任务。我们的努力超越了早期的调查,只有任务特定或仅集中在一种类型的视觉内容,即图像或视频。此外,我们还提供了一些潜在的未来方向,在这一研究领域,预期,这项调查刺激了创新的思想和想法,以解决现有的挑战并建立新的应用。
translated by 谷歌翻译